SQL Server 2008 : SSIS Architecture and Concepts

11/23/2010 2:50:36 PM

You can think of SSIS as a data import/export/transformation layer in the overall system architecture that you are deploying for at least most of your Microsoft-based applications and a few non-Microsoft applications (see Figure 1 ). SSIS allows you to “data enable” almost all the individual applications or systems that are part of your overall implementation, such as OLTP databases, multidimensional cubes, OLAP data warehouses, Excel files, Access databases, flat files, other heterogeneous database sources, and even web services. The Integration Services object model includes both native and managed APIs for doing most SSIS work. This includes APIs for any of the SSIS tools, the command-line utilities, and even custom applications. SSIS Designer and the Integration Services Wizard both use the Integration Services object model. SSIS includes the integration service itself (that is, the service that manages all SSIS packages), the Integration Services object model, the SSIS runtime and runtime executables, and the data flow task (which has a data flow engine, source, transformation, and destination components).

Figure 1. SSIS architecture.

Microsoft uses SSIS packages to implement any data movement/transformation. Basically, Microsoft treats SSIS packages as if they are managed code and requires that you create Integration Services projects and deployment utilities as part of managing these SSIS packages. In addition, separate Integration Services Connection projects can now be created to aid in connection and provider services. All in all, this is a very good approach that significantly reduces errors and allows you to go through a reasonably formal release to production (that is, development and deployment) cycle.

SSIS packages contain a collection of connections, control flow elements, data flow elements, event handlers, variables, and configurations. They take the form of tasks, containers, transformations, and workflows. SSIS packages go through one or more steps that are either executed sequentially or in parallel at package execution time. In a nutshell, when an SSIS package is executed, it does the following:

1.	Connects to any identified data source
2.	Copies data (and database objects, if needed)
3.	Transforms data
4.	Disconnects from the data sources
5.	Notifies users, processes, and even other packages of events (such as sending an email when something is done or has errors)

The basic SSIS package consists of the following:

SSIS packages— A package is a discrete, named collection of connections, control flow, and data flows that implement data movement/data transformation.
SSIS control flow and tasks— One or more tasks and containers drive what the package does. You organize control flow based on what you want the package to do. Tasks are the actions taken to accomplish the desired data transformation and movement. A task can execute any SQL statement, send mail, bulk insert data, execute an ActiveX script, run a Visual Studio Tool for Application script (VSTA), or launch another package or an external program.
SSIS containers— A container groups one or more related tasks that you want to manage together (and reuse together).
Workflows— Workflows are definable precedence constraints that allow you to link two tasks, based on whether the first task executes, executes successfully, or executes unsuccessfully. Workflow containers are the wrappers for the tasks and are the means for the flow of control. A task can run alone, parallel to another task, or sequentially, according to precedence constraints. Precedence constraints are of three types:
- Unconditional—It does not matter whether the preceding step failed or succeeded.
- On success—The preceding step must have been successful for the execution of the next step.
- On failure—This constraint returns the appropriate error.
SSIS data flow— The data flow identifies the sources and destinations that extract and load data; identifies the transformations that manipulate or enhance the data; and provides the paths that link sources, transformations, and destinations.
SSIS data flow task— A data flow task creates, orders, and runs the data flows themselves, using a data flow engine.
SSIS transformations— Transformations are one or more functions or operations applied against a piece of data before the data arrives at the destination.

In SSIS, everything is pretty much a task or a collection of tasks (one or more containers, tasks in containers), as you can see in Figure 2 . Control flow determines the overall execution of the package and data flows that access the data, transform it, and write it. Precedence constraints determine the overall control flow—connecting the executables, containers, and tasks into an ordered control flow.

Figure 2. SSIS package elements.

SSIS also has several objects that extend package functionality:

SSIS event handlers— These workflow tasks run in response to events raised by a package, task, or container. This is much the same as most programming languages, such as Java or C#. If a task (or package or container) has some issue (that is, raises an event), the event handler can be used to handle the issue appropriately. Typical events in data transformation processing that need to be handled with event handlers might include connections not being established, disk space issues, and so on. You can even have the event handlers write out emails or initiate other workflows.
SSIS configurations— These objects are used to help parameterize many of the previously hard-bound characteristics of packages at runtime. When a package is run, the configuration information is loaded (updating the values of the package’s properties), and then the package is run using the new configuration values (all without having to modify the package). SSIS configurations use the classic property/value pair paradigm to represent the properties that are to be configurable. Following are the varied methods of representing configuration files:
- XML configuration file—This file identifies the configuration property/value pairs for any number of configuration values. The following sample XML configuration file is for a package named UnleashedPackage with a property of PKGVar:
```
<?xml version="1.0"?>
<DTSConfiguration>
   <DTSConfigurationHeading>
      <DTSConfigurationFileInfo
          GeneratedBy="DatabaseArchitechs\PBertucci"
          GeneratedFromPackageName="UnleashedPackage"
          GeneratedFromPackageID="{3GV09721-816B-4E28-9878-0DE37A150234}"
          GeneratedDate="7/09/2009 7:12:22 AM"/>
   </DTSConfigurationHeading>
   <Configuration ConfiguredType="Property"
Path="\Package.Variables[User::PKGVar].Value"
ValueType="Int32">
      <ConfiguredValue>0</ConfiguredValue>
   </Configuration>
</DTSConfiguration>
```
  A configuration header contains information about the configuration file. This element includes attributes such as when the file was created and the name of the person who generated the file. In addition, a configuration element contains information about each configuration. This element includes attributes such as the property path and configured value of a property.
- Configuration table in SQL Server—This table stores configuration entries for use by the packages.
- Environment variables (VARs)—These can be referenced by the package.
- Parent package VARs—These can be used by child packages.
- Entry in Registry—The Registry can also contain the configuration values.
SSIS Logging— Logging can be done from any task or package to write out any type of logging information desired. When a supplied logging provider is used, a package can provide a rich runtime history. Logs are associated with packages (that is, the reference point), but any task (or container) can write to any package’s log. In this way, it is possible to have consolidated logs of a driver package with the full execution history of all child packages. The log providers (out of the box) write to a flat file (text file) or to SQL Server tables. Other custom logging providers can be used, though. You can log what you need to log—start date/time, end date/time, records transformed, errors, and so on.
SSIS variables— SSIS has both system variables and user-defined variables. System variables provide runtime package object information to tasks or other packages. This information is helpful when you want to reference these system variables to help decide what to do next. (They can be used in expressions, scripts, and configurations.) User-defined variables are really for specialized variables that are not found as system variables and only have to be used within a package’s scope. Again, these variables can be used in expressions, scripts, and configurations within a package.

SSIS packages can run other packages. This capability is very helpful when you want to granularly break out common data transformations for reuse by many different higher-level solutions (that is, higher-level packages that execute common-detail-level transformation packages).

Note

When an SSIS package is first created, it is given a globally unique identifier (GUID) that is added to the package’s ID property and a name that is added to its NAME property. After these are created, they become part of the reference mechanism for the package itself. If you ever copy a package as the basis of a new package, you have to rename these two properties so they are unique (that is, new GUID and new NAME property). If you simply want to give an existing package a new NAME or ID value, you can do so directly or with the dtutil command-line utility.

You can also create packages that can be restarted at a point of failure, including restarting specific tasks within a package (and not all the tasks in a package). This is a super addition to SQL Server 2008. If a package had more than one data flow task and one completed but the others didn’t, you could restart just the data flow tasks that had not completed without rerunning the ones that had worked fine. Long-running packages can also create checkpoints to provide milestones from which to restart. This capability will save many sleepless nights for the folks doing production support for data transformation processing.